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ABSTRACT 

This  paper  briefly  reviews  some  of  the  recent  developments  in  software 
fault  tolerance  through  program  redundancy.  Language  constructs,  f \ 

system  architectures  and  design  methodologies  developed  or  proposed  7 ? * 

for  programs  containing  redundancy  for  error  tolerance,  are  reviewed.  I 


1.  INTRODUCTION 

Production  of  large-scale  error -free  programs 
is  a difficult  task.  Tolerance  of  software 
faults  means  that,  during  program  execution, 
the  damaging  effects  of  residual  design  errors 
are  contained  and  acceptable  results  are  con- 
tinuously generated  for  the  environment.  In 
recent  years  the  possibility  of  using  program 
redundancy  for  the  purpose  of  providing  soft- 
ware fault  tolerance  has  become  a subject  of 
growing  interest  [1-27].  A program  containing 
redundancy  for  the  tolerance  of  software  and/or 
hardware  faults  is  called  a fault-tolerant  pro- 
gram [6] . 

A fault-tolerant  program  designed  for  tolerating 
residual  design  errors  normally  contains  multi- 
ple routines  that  can  compute  the  same  function. 
If  these  routines  use  distinct  algorithms,  it  is 
less  likely  that  the  same  design  error  will 
exist  in  all  of  the  routines  that  compute  the 
function.  The  use  of  program  redundancy  for 
software  fault  tolerance  is  motivated  by  the 
present  situation  in  software  engineering: 

( 1 ) In  spite  of  recent  advances  in  program 
validation  and  design  methodology,  large-scale 
programs  are  often  put  into  operation  with 
residual  design  errors. 

(2)  Computing  systems  are  increasingly  em- 
ployed In  critical  environments  where  the  cost 
of  failure  is  extremely  high. 

( 3 ) Maintenance  and  modification  of  operation- 
al programs  are  practical  necessities  and  often 
introduce  new  errors. 

(4)  Hardware  reliability  and  economy  have  been 
**■  significantly  improved  in  recent  years,  sub- 
U stantially  increasing  the  freedom  in  using  extra 
hardware. 

Recent  developments  in  fault-tolerant  program- 
ming have  taken  place  in  several  areas:  (1) 

t This  work  was  supported  in  part  by  the  Joint 
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language  constructs  for  structuring  fault-tolerant 
programs,  (2)  system  architectures  for  pro- 
cessing fault-tolerant  programs,  (3)  design 
methodologies  for  fault-tolerant  programs. 

This  paper  is  a brief  overview  of  some  of  these 
developments. 

2.  CHARACTERISTICS  ' 

FAULT -TOLERANT  PROGRAMMING 

Program  redundancy  can  be  used  in  many 
different  forms  in  achieving  fault-tolerance.  In 
a well-structured  system  (e.g.  , hierarchically 
structured  as  discussed  in  [19,21,22]),  the  same 
schemes  for  incorporating  program  redundancy 
can  be  used  for  both  hardware  and  software 
fault  tolerance.  In  [12],  Horning  et  al.  identi- 
fied three  major  types  of  program  redundancy 
useful  for  achieving  error  tolerance:  (1) 

acceptance  tests  which  check  the  reasonableness 
or  acceptability  of  intermediate  results  during 
program  execution,  (2)  alternate  routines  which 
are  invoked  to  compute  the  same  objective 
function  in  different  ways  when  the  result  pro- 
duced by  the  primary  routine  is  rejected,  and 
(3)  recovery  routines  which,  when  invoked  on 
rejection  of  an  intermediate  result  by  an  ac- 
ceptance test,  restore  the  system  state  to  a 
"legitimate"  state  (where  a retry  or  execution 
of  the  rest  of  the  program  can  start). 

The  use  of  program  redundancy  is  not  without 
cost,  however.  Fault-tolerant  programs  have 
generally  a high  development  cost  and  a higher 
processing  cost  than  conventional  programs. 

One  way  of  reducing  the  development  cost  is  to 
develop  methodologies  and  tools  for  systematic 
incorporation  of  redundancy  in  a well-structured 
form.  Good  structuring  of  a fault-tolerant 
program  is  particularly  important  since  the  use 
of  program  redundancy  increases  the  program 
size. 

The  high  processing  cost  of  fault-tolerant  pro- 
grams is  due  to  the  increased  processor  time 
and  increased  storage  space  required  for  exe- 
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cuting  redundant  program  components.  For 
example,  to  provide  a recovery  capability  re- 
quires processor  time  and  storage  space  for 
recording  and  retaining  some  histories  of 
computation  (i.  e. , saving  of  system  states  at 
various  points  in  program  execution).  Exe- 
cution of  acceptance  tests  in  fault-tolerant  pro- 
grams can  also  be  a factor  contributing  to  a 
large  execution  time  [4,6].  System  architec- 
tures reviewed  in  section  4 aim  at  the  reduc- 
tion of  execution  time  and/or  storage  space. 

3.  LANGUAGE  CONSTRUCTS 
FOR  FAULT -TOLERANT  PROGRAMMING 

A language  construct  developed  by  Horning  et 
al. , called  a recovery  block  or  fault-tolerant 
block,  allows  incorporation  of  program  redun- 
dancy into  a block-structured  program  in  a 
well -structured  manner  [2,12,21].  A fault- 
tolerant  block  has  the  following  structure: 
ensure  T bj  Oj  else -by  02  . . . else -by  O 
else-error,  where  T denotes  the  acceptance 
test,  Oj  the  primary  object  block  and  O^ 

(1  <k  £ n)  the  alternate  object  blocks.  All  the 
object  blocks  in  a fault-tolerant  block  F com- 
pute the  same  or  approximately  the  same  ob- 
jective function.  The  acceptance  test  T is 
executed  on  exit  from  an  object  block  to  con- 
firm that  the  object  block  has  performed 
acceptably.  The  execution  of  an  acceptance 
test  results  in  either  an  acceptance  (i.  e. , con- 
firmation) or  a rejection.  If  accepted,  control 
exits  from  the  fault-tolerant  block.  If  the 
result  produced  by  an  object  block  is  rejected, 
the  next  alternate  object  block  is  entered. 

After  the  alternate  object  block  finishes  its 
computation,  the  acceptance  test  is  repeated. 

The  following  aspects  of  this  scheme  should  be 
noted:  (1)  the  primary  or  alternate  object 

blocks  can  contain,  nested  within  themselves, 
further  fault-tolerant  blocks;  (2)  execution  of 
the  acceptance  test  may  require  reference  to 
both  the  original  and  the  modified  values  of 
variables  non-local  to  the  object  block;  and  (3) 
it  is  not  necessary  that  every  block  in  a block- 
structured  fault-tolerant  program  be  a fault- 
tolerant  block.  This  flexibility  is  particularly 
important  since  it  allows  incorporation  of  pro- 
gram redundancy  only  where  desired. 

Before  an  alternate  object  block  is  entered,  the 
system  state  is  restored  to  the  state  that  existed 
just  before  entry  to  the  primary  object  block. 

Each  variable  that  was  assigned  a new  value  by 
the  rejected  execution  is  restored  to  the  original 
vr.lue  of  the  variable.  The  system  automatically 
performs  this  "assignment  reversal".  A state 
vector  that  contains  the  values  of  all  the  vari- 
ables (that  may  be  changed  by  the  object  blocks) 
is  saved  on  entry  to  a fault -tolerant  block.  If 
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assignment  reversal  is  later  necessary,  this 
state  vector  is  used  to  restore  the  earlier  vari- 
variable  values.  Note  that  the  programmer  is 
relieved  from  specifying  this  "standard"  re- 
covery action. 

However,  certain  components  of  a system  can- 
not be  restored  in  such  a manner.  For  exam- 
ple, operations  involving  file  access,  accounting 
operations,  or  operations  dependent  upon  a real- 
time clock  cannot  be  reversed  by  the  method 
used  for  automatic  reversal  of  assignments. 
Horning  et  al.  proposed  a language  construct, 
called  a recoverable  procedure,  for  the 
specification  of  non-standard  recovery  operations 
[12],  A recoverable  procedure  allows  the  pro- 
grammer to  specify  (1)  the  opterations  of  saving 
information  that  may  be  required  for  non-standard 
recovery,  (2)  the  normal  computation,  and  (3) 
the  non-standard  recovery  operations  themselves. 

Checking  the  validity  of  input  to  a program 
module  is  a common  programming  practice 
especially  in  real-time  applications  [11]  and 
two  approaches  to  the  incorporation  of  the  in- 
put check  into  a fault-tolerant  block  are  com- 
pared in  [17).  One  approach  is  to  place  before 
a fault-tolerant  block  F an  '"empty^  fault-tolerant 
block  Fe  which  contains  nothing  but  an  accep- 
tance test  Te  which  checks  the  validity  of  input 
to  F.  The  other  approach  is  to  extend  the 
acceptance  test  to  consist  of  two  parts,  the 
input  acceptance  test  T*  and  the  result  accep- 
tance test  T**-J  The  syntactic  structure  thus 
becomes:  ensure  on- entry  T*  on- exit  T®1  b£ 

Oj  else-by  02  •••  else-error.  The  input 
acceptance  test  T*  is  executed  "before"  the 
primary  object  block  Oj.  Both  approaches 
may  be  used  in  structuring  a fault-tolerant 
program. 

If  a loop  in  a fault -tolerant  program  contains 
a fault-tolerant  block  as  its  body,  then  the 
acceptance  test  is  entered  at  every  iteration 
to  confirm  the  result.  Sometimes  it  is  desir- 
able to  invoke  the  validation  process  less 
frequently  tha.;  at  every  iteration.  For  flexible 
invocation  of  the  acceptance  test,  an  extension 
of  the  fault-tolerant  block  was  proposed  to 
include  a predicate  so  that  when  the  predicate 
is  "false"  on  entry  to  the  fault-tolerant  block, 
only  the  primary  object  block  is  executed  with- 
out invoking  the  execution  of  the  acceptance 
test  [17].  A syntactic  structure  of  the  extended 
fault-tolerant  block  is:  ensure  when  < logical 

expression>  that  on-entry  T1  on -exit  T®-  bv 
Oj  else-by  O?  ...  else-error. 

For  real-time  applications,  a watchdog  timer 
which  is  set  on  entry  to  an  object  block  and 
signals  an  error  if  a preset  time  limit  is 
reached,  can  be  useful  [4,11,12,18].  Elmendorf 
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discussed  a set  of  (assembly)  language  primi- 
tives for  controlling  both  concurrent  execution 
of  several  alternates  and  multiple  executions 
of  the  same  routine  [6]. 

Exception  handling  is  a generic  term  referring 
to  detecting  and  responding  to  abnormal  or 
undesired  events  [10,  13,  19].  Thus  the  block- 
structured  fault-tolerant  programming  discussed 
above  can  be  viewed  as  a highly  structured 
form  of  exception  handling  in  which  error 
detection  normally  occurs  on  completion  of  an 
object  block  and  recovery  generally  involves 
complete  cleanup  of  the  contaminated  part  of 
the  system  by  a backup  to  an  earlier  state.  In 
contrast,  most  approaches  to  exception  handling 
aim  at  immediate  error  detection  (i.  e. , detec- 
tion during  execution  of  a statement/instruction) 
and  "efficient  patching"  of  the  system  from  an 
undesired  state  into  a legitimate  state  (e.  g. , 

PL/  1 ON-construct).  Using  the  language  con- 
structs designed  to  support  such  exception 
handling,  error  detection  and  recovery  can  be 
structured  in  a flexible  and  unrestricted  manner 
(e.  g.  , non-nested  structure).  However,  they 
impose  a greater  burden  on  the  programmer. 

It  seems  possible  to  realize  the  immediate 
error  detection  and  efficient  system  patching 
effect  to  a .large  extent  by  extensive  use  of 
recoverable  procedures  in  the  fault-tolerant 
block  scheme. 

When  a set  of  parallel  processes  interact,  each 
capable  of  self-checking  and  recovery,  an  error 
in  one  process  may  affect  a different  process 
with  which  it  has  interacted  [8,21,23,24],  For 
instance,  if  process  A has  produced  data  for 
process  B and  this  data  is  later  determined  to 
be  incorrect  by  an  acceptance  test  in  process 
A,  then  both  process  A and  process  B may 
have  to  be  restored  to  an  earlier  state. 

Russell  and  Bredt  showed  that  explicit  error 
handling  could  lead  to  complicated  program 
structure  even  in  the  simple  case  of  a producer- 
consumer  system  [23].  Randell  recognized  that 
uncontrolled  use  of  process  communication  along 
with  error  detection  and  recovery  could  lead  to 
a disastrous  avalanche  of  backup  activity  called 
the  domino  effect  [21].  In  order  to  control  the 
domino  effect  Randell  proposed  to  restrict  the 
interactions  of  parallel  processes  to  take  place 
inside  conversations  [21].  A conversation  is 
a fault-tolerant  block  which  spans  two  or  more 
processes  and  creates  a "boundary"  which 
process  interactions  may  not  cross.  On  entry 
to  a conversation,  the  state  of  each  process  is 
saved  and  if  any  process  fails  its  own  accep- 
tance test  at  the  end  of  the  conversation,  all 
the  processes  are  backed  up  to  the  beginning  of 
the  conversation.  Power  of  conversations  as 
tools  for  structured  fault-tolerant  parallel  pro- 


gramming remains  to  be  conclusively  demon- 
strated.. 

4.  SYSTEM  ARCHITECTURES  FOR 
PROCESSING  FAULT-TOLERANT  PROGRAMS 

Since  the  processing  of  fault-tolerant  programs 
involves  saving  of  state  vectors  and  examination 
of  saved  state  vectors  (for  validation  and  re- 
covery), the  processing  cost  can  be  reduced  by 
saving  state  vectors  in  a compact  form.  It  is 
useful  to  save  only  the  differences  between  suc- 
cessive state  vectors,  rather  than  to  save  the 
entire  state  vector  on  entry  to  each  fault -tolerant 
block.  The  processing  cost  can  also  be  reduced 
by  noting  that  the  varibles  local  to  the  object 
block  are  irrelevant  to  the  recovery  and  in  many 
cases,  only  a small  subset  of  the  non-local 
variables  are  modified  by  the  object  block.  To 
exploit  these  properties.  Horning’ et  al.  developed 
a scheme  called  the  recovery  cache  which  saves 
state  vectors  in  a compact  form  [2,  12].  The 
essence  of  this  scheme  is  to  record  the  original 
value  of  each  non-local  variable  together  with 
its  logical  address  in  a table  right  before  the 
variable  is  modified  for  the  first  time  in  a new 
block. 

Validation  of  results  and  saving  of  state  vectors 
contribute  to  an  increase  in  program  execution 
time.  When  any  non-trivial  validation  is  em- 
ployed or  large  numbers  of  non-local  variables 
are  modified  during  execution  of  object  blocks, 
the  program  execution  time  even  in  the  case  of 
normal  fault-free  operation  could  very  well  ex- 
ceed the  tolerable  limit  in  real-time  applications 
[4,6].  (It  is  indeed  expected  that  useful  accep- 
tance tests  may  frequently  be  quite  complex. ) 

As  a possible  solution,  an  approach  of  over- 
lapping object  block  execution  with  the  validation 
and  state  vector  saving  was  proposed  in  [14]. 

This  parallel  execution  approach  requires  a 
multiprocessor  system  in  which  (1)  histories  of 
the  main-stream  computation  are  kept  in  a 
compact  form,  and  (2)  interference  between 
processors,  particularly  storage  conflicts,  is 
negligible  or  absent.  A multiprocessor  system 
architecture  that  satisfactorily  meets  these  re- 
quirements by  employing  a novel  memory  organi- 
zation, named  a duplex  memory,  was  sketched 
in  [14]  and  fully  described  in  [15].  The  duplex 
memory  also  offers  compact  storage  for  state 
vectors  but  in  a different  way  than  the  recovery 
cache.  Performance  evaluation  of  this  system 
based  on  a stochastic  model  confirmed  high 
effectiveness  of  the  parallel  system  in  reducing 
the  execution  time  increase  due  to  validation 
and  state  vector  saving  [16]. 

There  are  other  systems  already  developed  or 
under  development  which  support  the  structured 
use  of  program  redundancy  for  error  detection 


and  recovery,  e.  g.  , ESS  No.  2 [4],  System  250 
[22],  PRIME  [7],  HYDRA  [25],  etc.  These 
systems  do  not  employ  special  hardware  com- 
ponents unlike  the  systems  discussed  above. 

EES  No.  2 uses  extensive  redundancy  in  data 
structures  and  periodically  calls  in  an  audit 
program  (i.  e. , a collection  of  validation  tests) 
to  check  the  consistency  of  the  data  structures 
and  the  integrity  of  data.  Thus  audit  programs 
in  this  system  are  loosely  coupled  with  applica- 
tion programs.  Yau  et  al.  proposed  a software 
subsystem,  called  the  system  monitor,  that 
interfaces  with  the  operating  system  on  one  side 
and  with  application  programs  cn  the  other  side 
[27],  The  system  monitor  runs  in  a more 
privileged  protection  domain  than  the  application 
programs  and  is  responsible  for  safe  execution 
of  user-supplied  error  detection  and  recovery 
codes  designed  to  ensure  acceptable  performance 
of  the  application  programs.  In  System  250, 
the  software  subsystem  is  hierarchically  struc- 
tured and  each  level  contains  its  own  set  of 
software  checks  t(for  error  detection)  and  re- 
covery procedures.  If  the  recovery  function 
at  a level  fails,  then  the  recovery  at  a lower 
level  (i.  e.  , a level  closer  to  the  system  nu- 
cleus) takes  over,  analogous  to  a hierarchy  of 
recovery  functions  built  in  multi-level  nested 
fault-tolerant  blocks.  PRIME  is  a multipro- 
cessor time-sharing  system  and  uses  the 
technique  of  dynamic  decision  verification  which 
means  that,  every  time  the  operating  system 
running  on  one  processor  makes  a decision 
there  is  a followup  consistency  check  that  uses 
a different  algorithm  and  runs  on  a different 
processor.  HYDRA  is  a kernel  (nucleus) 
operating  system  for  C.mmp  multiprocessor' 
syst«m  and  is  composed  of  modules  each 
capable  of  input  check,  acceptance  test  and 
error  report.  For  important  data  structures, 
their  descriptors  (i.  e. , type  and  size  infor- 
mation) are  stored  in  redundant  forms. 


5.  OTHER  DEVELOPMENTS  AND 
UNRESOLVED  ASPECTS 

Progress  has  also  been  made  in  other  aspects 
of  fault-tolerant  programming.  Some  efforts 
have  been  made  to  develop  effective  design 
methodologies  for  fault-tolerant  programs. 
Anderson  proposed  a proof-guided  methodology 
for  designing  acceptance  tests  [1].  He  asserted 
that  the  use  of  acceptance  tests  should  simplify 
proofs  of  adequacy  (weaker  than  correctness), 
and  in  turn,  the  attempts  to  construct  proofs 
should  provide  indications  of  where  acceptance 
tests  could  usefully  be  employed.  Kopetz  argued 
for  the  usefulness  of  an  alternate  object  block 
which  is  designed  to  provide  minimally  accep- 
table service  to  the  rest  of  the  program  [18]. 

He  then  proposed,  for  efficient  implementation 
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and  debugging,  to  start  program  development 
with  the  construction  of  acceptance  tests  and 
alternate  object  blocks  performing  minimally 
acceptable  functions  and  then  to  proceed  to  the 
construction  of  primary  object  blocks.  Parnas 
and  Wiirges  identified  several  basic  types  of 
exceptional  conditions  and  discussed  methods  of 
structuring  inter -level  exception  handling  in 
hierarchically  organized  systems  [19].  Russell 
was  concerned  with  fault-tolerant  inter -process 
communication  that  is  both  domino-free  and 
efficient  [24].  He  considered  systems  of  parallel 
processes  which  were  constructed  by  using 
primitives  such  as  MARK  for  saving  system 
states,  RESTORE  for  restoring  the  system 
state  and  PURGE  for  discarding  no  longer  use- 
ful system  state  information,  and  then  studied 
conditions  under  which  systems  using  these 
primitives  can  be  guaranteed  not  to  be  subject 
to  any  possible  domino  effect.  Gilb  reported 
that  the  approach  of  dual  coding  (i.  e.  , developing 
two  different  versions  of  a program  with  the 
same  specification  by  two  independent  program- 
mers or  the  same  programmer  working  in 
different  languages)  had  been  highly  cost-effec- 
tive in  many  software  projects  [9].  Successful 
application  of  this  approach  to  obtaining  alter- 
nates in  fault-tolerant  programs  w.as  also  re- 
ported. Some  error  detection  and  recovery 
techniques  used  in  practice  were  reviewed  by 
Yau  and  Cheung  [26]  and  also  in  [20]. 

Denning  summarized  a broad  range  of  principles 
of  structuring  operating  systems  to  make  un- 
authorized actions  impossible,  to  confine  errors 
to  the  immediate  contexts  of  their  occurrences, 
and  to  detect  and  correct  damage  before  it 
spreads  [5].  He  also  stressed  the  need  for 
considerable  hardware  support  in  efficient  im- 
plementation of  those  principles. 

Fault-tolerant  programming  is  still  in  its  in- 
fancy. Only  limited  experience  with  fault - 
tolerant  programs  has  been  obtained,  although 
highly  optimistic  projections  have  come  out  of 
the  limited  experience  [2,4,9,11,18].  Much 
more  work  is  needed  in  many  areas,  among 
them  the  following: 

( 1 ) the  development  of  system  aids  which  are 
useful  at  other  phases  of  the  program  develop- 
ment cycle  - specification,  validation,  analysis, 
maintenance; 

(2)  the  development  of  additional  language 
constructs,  particularly  for  fault-tolerant 
parallel  programming; 

( 3 ) the  development  of  design  methodologies 
for  fault-tolerant  programs,  i.  e. , for  acceptance 
tests  and  alternates; 

(4)  the  establishment  of  a theoretical  foundation 
for  fault-tolerant  programming. 
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Further  experience  with  fault -tolerant  program- 
ming will  provide  additional  insight  into  the 

directions  of  future  work  and  into  the  full 

potential  of  fault -tolerant  programmming. 
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